Word Image Matching as a Techique for Degraded Text Recognition
نویسندگان
چکیده
A technique is presented that determines equivalences between word images in a passage of text. A clustering procedure is applied to group visually similar words. Initial hypotheses for the identities of words are then generated by matching the word groups to language statistics that predict the frequency at which certain words will occur. This is followed by a recognition step that assigns identifications to the images in the clusters. This paper concentrates on the clustering algorithm. A clustering technique is presented and its performance on a running text of 1062 word images is determined. It is shown that the clustering algorithm can correctly locate groups of short function words with better than a 95 percent correct rate.
منابع مشابه
Word Image Matching in a Methodology for Degraded Text Recognition
A technique for the use of global context in text recognition is presented that determines equivalences between word images in a passage of text. Initial hypotheses for the identities of words are then generated by matching the word groups to language statistics that predict the frequency at which certain words will occur. This is followed by a recognition step and a relaxation-based control st...
متن کاملIntegration of Visual Inter-Word Constraints and Linguistic Knowledge in Degraded Text Recognition
Degraded text recognition is a di cult task. Given a noisy text image, a word recognizer can be applied to generate several candidates for each word image. Highlevel knowledge sources can then be used to select a decision from the candidate set for each word image. In this paper, we propose that visual inter-word constraints can be used to facilitate candidate selection. Visual inter-word const...
متن کاملPrototype Extraction and Adaptive OCR
ÐTo maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method automatically generates training samples from unsegmented text images and the corresponding transcripts. It is tolerant of transcription errors, so a ...
متن کاملLine and Word Matching in Old Documents
This paper is concerned with the problem of establishing an index based on word matching. It is assumed that the book was digitised as better as possible and some pre-processing techniques were already applied as line orientation correction and some noise removal. However two main factor are responsible for being not possible to apply ordinary optical character recognition techniques (OCR): the...
متن کاملRODRIGUEZ-SERRANO, PERRONNIN: LABEL EMBEDDING FOR TEXT RECOGNITION 1 Label embedding for text recognition
The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields (CRF). This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to b...
متن کامل